Similarity-based listing recommendations in a data exchange

ABSTRACT

A set of affinity metrics may be determined for a set of listings, each listing of the set of listings comprising data to be shared through a data exchange, wherein the set of affinity metrics includes a set of characteristics allowing identification of a listing having one or more characteristics in the set of characteristics. For each pair of listings of the set of listings, an affinity score can be calculated, using the set of affinity metrics, and stored as part of the record in an affinity store. One or more listings of the set of listings using the affinity score between the first listing of the set of listings and the one or more listings of the set of listings can be presented.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of co-pending U.S. application Ser.No. 17/872,463, filed Jul. 25, 2022, entitled “Similarity-Based ListingRecommendations In A Data Exchange,” which claims the benefit of U.S.Provisional Application No. 63/351,685 filed on Jun. 13, 2022, andentitled “Similarity-Based Listing Recommendations In A Data Exchange,”and these applications are hereby incorporated by reference herein intheir entirety.

TECHNICAL FIELD

The present disclosure relates to data sharing platforms, andparticularly to providing similarity-based listing recommendations for adata sharing platform.

BACKGROUND

Databases are widely used for data storage and access in computingapplications. Databases may include one or more tables that include orreference data that can be read, modified, or deleted using queries.Databases may be used for storing and/or accessing personal informationor other sensitive information. Secure storage and access of databasedata may be provided by encrypting and/or storing data in an encryptedform to prevent unauthorized access. In some cases, data sharing may bedesirable to let other parties perform queries against a set of data.

BRIEF DESCRIPTION OF THE DRAWINGS

The described embodiments and the advantages thereof may best beunderstood by reference to the following description taken inconjunction with the accompanying drawings. These drawings in no waylimit any changes in form and detail that may be made to the describedembodiments by one skilled in the art without departing from the spiritand scope of the described embodiments.

FIG. 1A is a block diagram depicting an example computing environment inwhich the methods disclosed herein may be implemented, in accordancewith some embodiments of the present invention.

FIG. 1B is a block diagram illustrating an example virtual warehouse, inaccordance with some embodiments of the present invention.

FIG. 2 is a schematic block diagram of data that may be used toimplement a public or private data exchange, in accordance with someembodiments of the present invention.

FIG. 3 is a schematic block diagram of an example deployment of a dataexchange, in accordance with some embodiments of the present disclosure.

FIG. 4 is a flow diagram of an example deployment of a data exchangethat illustrates techniques for determining the similarity of listings,in accordance with some embodiments of the present invention.

FIG. 5 is a block diagram of an example computing device that mayperform one or more of the operations described herein, in accordancewith some embodiments of the present invention.

DETAILED DESCRIPTION

Data providers often have data assets that are cumbersome to share, butof interest to another entity. For example, a large online retailcompany may have a data set that includes the purchasing habits ofmillions of consumers over the last ten years. If the online retailerwishes to share all or a portion of this data with another entity, theonline retailer may need to use old and slow methods to transfer thedata, such as a file-transfer-protocol (FTP), or even copying the dataonto physical media and mailing the physical media to the other entity.This can have several disadvantages. First, it can be slow as copyingterabytes or petabytes of data can take days. Second, once the data isdelivered, the provider cannot control what happens to the data. Therecipient can alter the data, make copies, or share it with otherparties. Third, the only entities that would be interested in accessingsuch a large data set in such a manner are large corporations that canafford the complex logistics of transferring and processing the data aswell as the high price of such a cumbersome data transfer. Thus, smallerentities (e.g., “mom and pop” shops) or even smaller, nimblercloud-focused startups can be priced out of accessing this data, eventhough the data may be valuable to their businesses. This may be becauseraw data assets can be too unpolished and full of potentially sensitivedata to simply outright sell/provide to other companies. Data cleaning,de-identification, aggregation, joining, and other forms of dataenrichment may need to be performed by the owner of data before it isshareable with another party. This can be time-consuming and expensive.Finally, it can be difficult to share data assets with large numbers ofentities because, for the reasons mentioned above, traditional datasharing methods do not allow scalable sharing. Traditional sharingmethods can also introduce latency and delays in terms of all partieshaving access to the most recently updated data.

Private and public data exchanges may allow data providers to sharetheir data assets more easily and securely with other entities. A publicdata exchange (also referred to herein as a “Snowflake datamarketplace,” or a “data marketplace”) may provide a centralizedrepository with open access where a data provider may publish andcontrol live and read-only data sets to thousands of consumers. Aprivate data exchange (also referred to herein as a “data exchange”) maybe under the data provider's brand, and the data provider may controlwho can gain access to it. The data exchange may be for internal useonly, or available to consumers, partners, suppliers, or others. Thedata provider may control what data assets are exposed as well ascontrol who has access to which sets of data. This can allow a seamlessway to discover and share data both within a data provider'sorganization and with its business partners.

The data exchange may be facilitated by a cloud computing service suchas the SNOWFLAKE™ cloud computing service, and allows data providers tooffer data assets directly from their own online domain (e.g., website)in a private online marketplace with their own branding. The dataexchange may provide a centralized, managed hub for an entity to listinternally or externally-shared data assets, inspire data collaboration,and also to maintain data governance and to audit access. With the dataexchange, data providers may share data without copying it betweencompanies. Data providers may invite other entities to view their datalistings, control which data listings appear in their private onlinemarketplace, and control who can access data listings and how others caninteract with the data assets connected to the listings. This may bethought of as a “walled garden” marketplace, in which visitors to thegarden must be approved and access to certain listings may be limited.

As an example, Company A has collected and analyzed the consumptionhabits of millions of individuals in several different categories. Theirdata sets may include data in the following categories: online shopping,video streaming, electricity consumption, automobile usage, internetusage, clothing purchases, mobile application purchases, clubmemberships, and online subscription services. Company A may desire tooffer these data sets (or subsets or derived products of these datasets) to other entities, thus becoming a Data Supplier or Data Provider.For example, a new clothing brand may wish to access data sets relatedto consumer clothing purchases and online shopping habits. Company A maysupport a page on its website that is or functions substantiallysimilarly to a data exchange, where a data consumer, e.g., the newclothing brand, may browse, explore, discover, access and potentiallypurchase data sets directly from Company A. Further, Company A maycontrol who can enter the data exchange, the entities that may view aparticular listing, the actions that an entity may take with respect toa listing, e.g., view only, and any other suitable action. In addition,a data provider may combine its own data with other data sets from,e.g., a public data exchange (also referred to as a “data marketplace”),and create new listings using the combined data.

A data exchange may be an appropriate place to discover, assemble,clean, and enrich data to make it more monetizable. A large company on adata exchange may assemble data from across its divisions anddepartments, which could become valuable to another company. Inaddition, participants in a private ecosystem data exchange may worktogether to join their datasets together to jointly create a useful dataproduct that any one of them alone would not be able to produce. Oncethese joined datasets are created, they may be listed on the dataexchange or on the data marketplace.

Sharing data may be performed when a data provider creates a shareobject (hereinafter referred to as a share) of a database in the dataprovider's account and grants the share access to particular objects(e.g., tables, secure views, and secure user-defined functions (UDFs))of the database. Then, a read-only database may be created usinginformation provided in the share. Access to this database may becontrolled by the data provider. A “share” encapsulates all of theinformation required to share data in a database. A share may include atleast three pieces of information: (1) privileges that grant access tothe database(s) and the schema containing the objects to share, (2) theprivileges that grant access to the specific objects (e.g., tables,secure views, and secure UDFs), and (3) the consumer accounts with whichthe database and its objects are shared. The consumer accounts withwhich the database and its objects are shared may be indicated by a listof references to those consumer accounts contained within the shareobject. Only those consumer accounts that are specifically listed in theshare object may be allowed to look up, access, and/or import from thisshare object. By modifying the list of references of other consumeraccounts, the share object can be made accessible to more accounts or berestricted to fewer accounts.

In some embodiments, each share object contains a single role. Grantsbetween this role and objects define what objects are being shared andwith what privileges these objects are shared. The role and grants maybe similar to any other role and grant system in the implementation ofrole-based access control. By modifying the set of grants attached tothe role in a share object, more objects may be shared (by adding grantsto the role), fewer objects may be shared (by revoking grants from therole), or objects may be shared with different privileges (by changingthe type of grant, for example to allow write access to a shared tableobject that was previously read-only). In some embodiments, shareobjects in a provider account may be imported into the target consumeraccount using alias objects and cross-account role grants.

When data is shared, no data is copied or transferred between users.Sharing is accomplished through the cloud computing services of a cloudcomputing service provider such as SNOWFLAKE™. Shared data may then beused to process SQL queries, possibly including joins, aggregations, orother analysis. In some instances, a data provider may define a sharesuch that “secure joins” are permitted to be performed with respect tothe shared data. A secure join may be performed such that analysis maybe performed with respect to shared data, but the actual shared data isnot accessible by the data consumer (e.g., recipient of the share).

A data exchange may also implement role-based access control to governaccess to objects within consumer accounts using account level roles andgrants. In one embodiment, account level roles are special objects in aconsumer account that are assigned to users. Grants between theseaccount level roles and database objects define what privileges theaccount level role has on these objects. For example, a role that has ausage grant on a database can “see” this database when executing thecommand “show databases”; a role that has a select grant on a table canread from this table but not write to the table. The role would need tohave a modify grant on the table to be able to write to it.

Searching, by users, for relevant listings in a data marketplace, e.g.,Snowflake's Data Marketplace, can be a challenge. Such a search can befacilitated by providing a list (set) of listings related to a currentor past search, as well as, e.g., listing recommendations for otherlistings that might be of interest to a user. In some cases, thecontents of a listing may include native applications, that have beenshared by a data provider, rather than a dataset.

One approach for selecting listings to recommend involves collaborativefiltering (CF) to determine those listings that are complementary tolistings a user has previously interacted with. An example of CF isdetermining “users who performed ‘x-action’ on this listing alsoperformed ‘x-action’ on these listings,” where “x-action” might be“viewed,” “bought,” “downloaded,” “listened-to,” etc. Another example ofCF could be “these listings are popular among users like you.”

An alternative approach to CF, suitable for recommending listings in adata marketplace, distinguishes the operations (“x actions”) performedagainst, e.g., a listing, from account details and characteristics thatare specific to the listing. Such an approach can suggest listings thatprovide “replaceability” as opposed to data listings that are“complementary.” For example, a complementary search, as can be obtainedfrom CF, might recommend light bulbs to someone who has purchased afloor lamp. By contrast, a “replaceability,” or “similarity” searchmight suggest other types of lamps, e.g., reading lamps, desk lamps,lanterns, flashlights, etc.

In some cases, users may not be fully satisfied with CF-style“complementary” recommendations. Thus, it can be desirable to emphasize“replaceability,” or alternatives that are similar in nature to whatthey have, but different in some aspects, e.g., cost, characteristics,etc., in order to allow the user to fine-tune their selection.

The present disclosure provides techniques for determining affinity of aset of listings for a first listing within a data exchange. In someembodiments, for a listing, one or more affinity scores can bedetermined with one or more other listings. In some cases, distancescores can be determined that show the dissimilarity between listings.In some cases, these scores can be derived from SQL metadata stored indatabase tables and columns and from database and external functions. Insome embodiments, recommendations of similar listings can be based onpopularity, relevance, or on personalized recommendations based on userpreferences and/or activity. In some embodiments, upon determination ofa set of listings, that set, or list, can be presented to a consumer.

FIG. 1A is a block diagram of an example computing environment 100 inwhich the systems and methods disclosed herein may be implemented. Inparticular, a cloud computing platform 110 may be implemented, such asAmazon Web Services™ (AWS), Microsoft Azure™, Google Cloud™, or thelike. As known in the art, a cloud computing platform 110 providescomputing resources and storage resources that may be acquired(purchased) or leased and configured to execute applications and storedata.

The cloud computing platform 110 may host a cloud computing service 112that facilitates storage of data on the cloud computing platform 110(e.g., data management and access) and analysis functions (e.g., SQLqueries, analysis), as well as other computation capabilities (e.g.,secure data sharing between users of the cloud computing platform 110).The cloud computing platform 110 may include a three-tier architecture:data storage 140, query processing 130, and cloud services 120.

Data storage 140 may facilitate the storing of data on the cloudcomputing platform 110 in one or more cloud databases 141. Data storage140 may use a storage service such as Amazon S3™ to store data and queryresults on the cloud computing platform 110. In particular embodiments,to load data into the cloud computing platform 110, data tables may behorizontally partitioned into large, immutable files that may beanalogous to blocks or pages in a traditional database system. Withineach file, the values of each attribute or column are grouped togetherand compressed using a scheme sometimes referred to as hybrid columnar.Each table has a header which, among other metadata, contains theoffsets of each column within the file.

In addition to storing table data, data storage 140 facilitates thestorage of temp data generated by query operations (e.g., joins), aswell as the data contained in large query results. This may allow thesystem to compute large queries without out-of-memory or out-of-diskerrors. Storing query results this way may simplify query processing asit removes the need for server-side cursors found in traditionaldatabase systems.

Query processing 130 may handle query execution within elastic clustersof virtual machines, referred to herein as virtual warehouses or datawarehouses. Thus, query processing 130 may include one or more virtualwarehouses 131, which may also be referred to herein as data warehouses.The virtual warehouses 131 may be one or more virtual machines operatingon the cloud computing platform 110. The virtual warehouses 131 may becompute resources that may be created, destroyed, or resized at anypoint, on demand. This functionality may create an “elastic” virtualwarehouse that expands, contracts, or shuts down according to the user'sneeds. Expanding a virtual warehouse involves generating one or morecompute nodes 132 to a virtual warehouse 131. Contracting a virtualwarehouse involves removing one or more compute nodes 132 from a virtualwarehouse 131. More compute nodes 132 may lead to faster compute times.For example, generation of affinity information that takes fifteen hourson a system with four nodes might take only two hours with thirty-twonodes.

Cloud services 120 may be a collection of services that coordinateactivities across the cloud computing service 112. These services tietogether all of the different components of the cloud computing service112 in order to process user requests, from login to query dispatch.Cloud services 120 may operate on compute instances provisioned by thecloud computing service 112 from the cloud computing platform 110. Cloudservices 120 may include a collection of services that manage virtualwarehouses, queries, transactions, data exchanges, and the metadataassociated with such services, such as database schemas, access controlinformation, encryption keys, and usage statistics. Cloud services 120may include, but not be limited to, authentication engine 121,infrastructure manager 122, optimizer 123, exchange manager 124,security engine 125, and metadata storage 126.

FIG. 1B is a block diagram illustrating an example virtual warehouse131. The exchange manager 124 may facilitate the sharing of data betweendata providers and data consumers, using, for example, a data exchange.For example, cloud computing service 112 may manage the storage andaccess of a database 108. The database 108 may include various instancesof user data 150 for different users, e.g., different enterprises orindividuals. The user data 150 may include a user database 152 of datastored and accessed by that user. The user database 152 may be subjectto access controls such that only the owner of the data is allowed tochange and access the user database 152 upon authenticating with thecloud computing service 112. For example, data may be encrypted suchthat it can only be decrypted using decryption information possessed bythe owner of the data. Using the exchange manager 124, specific datafrom a user database 152 that is subject to these access controls may beshared with other users in a controlled manner. In particular, a usermay specify shares 154 that may be shared in a public or data exchangein an uncontrolled manner or shared with specific other users in acontrolled manner as described above. A “share” encapsulates all of theinformation required to share data in a database. A share may include atleast three pieces of information: (1) privileges that grant access tothe database(s) and the schema containing the objects to share, (2) theprivileges that grant access to the specific objects (e.g., tables,secure views, and secure UDFs), and (3) the consumer accounts with whichthe database and its objects are shared. When data is shared, no data iscopied or transferred between users. Sharing is accomplished through thecloud services 120 of cloud computing service 112.

Sharing data may be performed when a data provider creates a share of adatabase in the data provider's account and grants access to particularobjects (e.g., tables, secure views, and secure user-defined functions(UDFs)). Then a read-only database may be created using informationprovided in the share. Access to this database may be controlled by thedata provider.

Shared data may then be used to process SQL queries, possibly includingjoins, aggregations, or other analysis. In some instances, a dataprovider may define a share such that “secure joins” are permitted to beperformed with respect to the shared data. A secure join may beperformed such that analysis may be performed with respect to shareddata while the actual shared data is not accessible by the data consumer(e.g., recipient of the share). A secure join may be performed asdescribed in U.S. application Ser. No. 16/368,339, filed Mar. 18, 2019.

User devices 101-104, such as laptop computers, desktop computers,mobile phones, tablet computers, cloud-hosted computers, cloud-hostedserverless processes, or other computing processes or devices may beused to access the virtual warehouse 131 or cloud service 120 by way ofa network 105, such as the Internet or a private network.

In the description below, actions are ascribed to users, particularlyconsumers and providers. Such actions shall be understood to beperformed with respect to devices 101-104 operated by such users. Forexample, notification to a user may be understood to be a notificationtransmitted to devices 101-104, an input or instruction from a user maybe understood to be received by way of the user's devices 101-104, andinteraction with an interface by a user shall be understood to beinteraction with the interface on the user's devices 101-104. Inaddition, database operations (joining, aggregating, analysis, etc.)ascribed to a user (consumer or provider) shall be understood to includeperforming of such actions by the cloud computing service 112 inresponse to an instruction from that user.

FIG. 2 is a schematic block diagram of data that may be used toimplement a public or data exchange in accordance with an embodiment ofthe present invention. The exchange manager 124 may operate with respectto some or all of the illustrated exchange data 200, which may be storedon the platform executing the exchange manager 124 (e.g., the cloudcomputing platform 110) or at some other location. The exchange data 200may include a plurality of listings 202 describing data that is sharedby a first user (“the provider”). The listings 202 may be listings in adata exchange or in a data marketplace. The access controls, management,and governance of the listings may be similar for both a datamarketplace and a data exchange.

The listing 202 may include access controls 206, which may beconfigurable to any suitable access configuration. For example, accesscontrols 206 may indicate that the shared data is available to anymember of the private exchange without restriction (an “any share” asused elsewhere herein). The access controls 206 may specify a class ofusers (members of a particular group or organization) that are allowedto access the data and/or see the listing. The access controls 206 mayspecify that a “point-to-point” share in which users may request accessbut are only allowed access upon approval of the provider. The accesscontrols 206 may specify a set of user identifiers of users that areexcluded from being able to access the data referenced by the listing202.

Note that some listings 202 may be discoverable by users without furtherauthentication or access permissions whereas actual accesses are onlypermitted after a subsequent authentication step (see discussion ofFIGS. 4 and 6 ). The access controls 206 may specify that a listing 202is only discoverable by specific users or classes of users.

Note also that a default function for listings 202 is that the datareferenced by the share is not exportable by the consumer.Alternatively, the access controls 206 may specify that this is notpermitted. For example, access controls 206 may specify that secureoperations (secure joins and secure functions as discussed below) may beperformed with respect to the shared data such that viewing andexporting of the shared data is not permitted.

In some embodiments, once a user is authenticated with respect to alisting 202, a reference to that user (e.g., user identifier of theuser's account with the virtual warehouse 131) is added to the accesscontrols 206 such that the user will subsequently be able to access thedata referenced by the listing 202 without further authentication.

The listing 202 may define one or more filters 208. For example, thefilters 208 may define specific identity data 214 (also referred toherein as user identifiers) of users that may view references to thelisting 202 when browsing the catalog 220. The filters 208 may define aclass of users (users of a certain profession, users associated with aparticular company or organization, users within a particulargeographical area or country) that may view references to the listing202 when browsing the catalog 220. In this manner, a private exchangemay be implemented by the exchange manager 124 using the samecomponents. In some embodiments, an excluded user that is excluded fromaccessing a listing 202, i.e., adding the listing 202 to the consumedshares 156 of the excluded user, may still be permitted to view arepresentation of the listing when browsing the catalog 220 and mayfurther be permitted to request access to the listing 202 as discussedbelow. Requests to access a listing by such excluded users and otherusers may be listed in an interface presented to the provider of thelisting 202. The provider of the listing 202 may then view demand foraccess to the listing and choose to expand the filters 208 to permitaccess to excluded users or classes of excluded users (e.g., users inexcluded geographic regions or countries).

Filters 208 may further define what data may be viewed by a user. Inparticular, filters 208 may indicate that a user that selects a listing202 to add to the consumed shares 156 of the user is permitted to accessthe data referenced by the listing but only a filtered version that onlyincludes data associated with the identifier 214 of that user,associated with that user's organization, or specific to some otherclassification of the user. In some embodiments, a private exchange isby invitation: users invited by a provider to view listings 202 of aprivate exchange are enabled to do by the exchange manager 124 uponcommunicating acceptance of an invitation received from the provider.

In some embodiments, a listing 202 may be addressed to a single user.Accordingly, a reference to the listing 202 may be added to a set of“pending shares” that is viewable by the user. The listing 202 may thenbe added to a group of shares of the user upon the user communicatingapproval to the exchange manager 124.

The listing 202 may further include usage data 210. For example, thecloud computing service 112 may implement a credit system in whichcredits are purchased by a user and are consumed each time a user runs aquery, stores data, or uses other services implemented by the cloudcomputing service 112. Accordingly, usage data 210 may record an amountof credits consumed by accessing the shared data. Usage data 210 mayinclude other data such as a number of queries, a number of aggregationsof each type of a plurality of types performed against the shared data,or other usage statistics. In some embodiments, usage data for a listing202 or multiple listings 202 of a user is provided to the user in theform of a shared database, i.e., a reference to a database including theusage data is added by the exchange manager 124 to the consumed shares156 of the user.

The listing 202 may also include a heat map 211, which may represent thegeographical locations in which users have clicked on that particularlisting. The cloud computing service 112 may use the heat map to makereplication decisions or other decisions with the listing. For example,a data exchange may display a listing that contains weather data forGeorgia, USA. The heat map 211 may indicate that many users inCalifornia are selecting the listing to learn more about the weather inGeorgia. In view of this information, the cloud computing service 112may replicate the listing and make it available in a database whoseservers are physically located in the western United States, so thatconsumers in California may have access to the data. In someembodiments, an entity may store its data on servers located in thewestern United States. A particular listing may be very popular toconsumers. The cloud computing service 112 may replicate that data andstore it in servers located in the eastern United States, so thatconsumers in the Midwest and on the East Coast may also have access tothat data.

The listing 202 may also include one or more tags 213. The tags 213 mayfacilitate simpler sharing of data contained in one or more listings. Asan example, a large company may have a human resources (HR) listingcontaining HR data for its internal employees on a data exchange. The HRdata may contain ten types of HR data (e.g., employee number, selectedhealth insurance, current retirement plan, job title, etc.). The HRlisting may be accessible to 100 people in the company (e.g., everyonein the HR department). Management of the HR department may wish to addan eleventh type of HR data (e.g., an employee stock option plan).Instead of manually adding this to the HR listing and granting each ofthe 100 people access to this new data, management may simply apply anHR tag to the new data set and that can be used to categorize the dataas HR data, list it along with the HR listing, and grant access to the100 people to view the new data set.

The listing 202 may also include version metadata 215. Version metadata215 may provide a way to track how the listings have changed over time.This may assist in ensuring that the data that is being viewed by oneentity is not changed prematurely. For example, if a company has anoriginal data set and then releases an updated version of that data set,the updates could interfere with another user's processing of that dataset, because the update could have different formatting, new columns,and other changes that may be incompatible with the current processingmechanism of the recipient user. To remedy this, the cloud computingservice 112 may track version updates using version metadata 215. Thecloud computing service 112 may ensure that each data consumer accessesthe same version of the data until they accept an updated version thatwill not interfere with current processing of the data set.

The exchange data 200 may further include user records 212. The userrecord 212 may include data identifying the user associated with theuser record 212, e.g., an identifier (e.g., warehouse identifier) of auser having user data 151 in service database 158 and managed by thevirtual warehouse 131.

The user record 212 may list shares associated with the user, e.g.,reference listings 154 created by the user. The user record 212 may listshares consumed by the user, e.g., reference listings 202 created byanother user and that have been associated to the account of the useraccording to the methods described herein. For example, a listing 202may have an identifier that will be used to reference it in the sharesor consumed shares 156 of a user record 212.

The listing 202 may also include metadata 204 describing the shareddata. The metadata 204 may include some or all of the followinginformation: an identifier of the provider of the shared data, a URLassociated with the provider, a name of the share, a name of tables, acategory to which the shared data belongs, an update frequency of theshared data, a catalog of the tables, a number of columns and a numberof rows in each table, as well as name for the columns. The metadata 204may also include examples to aid a user in using the data. Such examplesmay include sample tables that include a sample of rows and columns ofan example table, example queries that may be run against the tables,example views of an example table, example visualizations (e.g., graphs,dashboards) based on a table's data. Other information included in themetadata 204 may be metadata for use by business intelligence tools,text description of data contained in the table, keywords associatedwith the table to facilitate searching, a link (e.g., URL) todocumentation related to the shared data, and a refresh intervalindicating how frequently the shared data is updated along with the datethe data was last updated.

The metadata 204 may further include category information indicating atype of the data/service (e.g., location, weather), industry informationindicating who uses the data/service (e.g., retail, life sciences), anduse case information that indicates how the data/service is used (e.g.,supply chain optimization, or risk analysis). For instance, retailconsumers may use weather data for supply chain optimization. A use casemay refer to a problem that a consumer is solving (i.e., an objective ofthe consumer) such as supply chain optimization. A use case may bespecific to a particular industry, or can apply to multiple industries.Any given data listing can help solve one or more use cases, and hencemay be applicable to multiple use cases.

Because use case information relates to how data is used, it can be apowerful tool for organizing/searching for data listings as it allowsconsumers of the data marketplace to explore and find listings andservices based on industry problems they're trying to solve (e.g.,supply chain optimization, audience segmentation). However, while userscan often find complementary listings, they can find it difficult tofind listings that suggest “replaceability” as opposed to“complementary.” For example, a complementary search might recommendlight bulbs to someone who has purchased a floor lamp. By contrast, a“replaceability,” or “similarity” search might suggest different kindsof lamps, e.g., reading lamps, desk lamps, lanterns, flashlights, etc.

Embodiments of the present disclosure solve the above and other problemsby enabling providers to assign “similarity” or “affinity” scores topairs of listings allowing identification of a set of listings havingcharacteristics most like a particular listing. The embodimentsdescribed herein make it easy for consumers to browse the data exchangebased on their business needs in order to find listings that solve thoseneeds. Embodiments of the present disclosure also enable a data exchangeto better serve their consumers' business needs based on their browsingpatterns, querying activities (individual and collective), and theintrinsic content of listings, and further personalize their overalldata exchange experience.

The exchange data 200 may further include a catalog 220. The catalog 220may include a listing of all available listings 202 and may include anindex of data from the metadata 204 to facilitate browsing and searchingaccording to the methods described herein. In some embodiments, listings202 are stored in the catalog in the form of JavaScript Object Notation(JSON) objects.

Note that where there are multiple instances of the virtual warehouse131 on different cloud computing platforms, the catalog 220 of oneinstance of the virtual warehouse 131 may store listings or referencesto listings from other instances on one or more other cloud computingplatforms 110. Accordingly, each listing 202 may be globally unique(e.g., be assigned a globally unique identifier across all of theinstances of the virtual warehouse 131). For example, the instances ofthe virtual warehouses 131 may synchronize their copies of the catalog220 such that each copy indicates the listings 202 available from allinstances of the virtual warehouse 131. In some instances, a provider ofa listing 202 may specify that it is to be available on only specifiedone or more computing platforms 110.

In some embodiments, the catalog 220 is made available on the Internetsuch that it is searchable by a search engine such as the Bing™ searchengine or the Google search engine. The catalog may be subject to asearch engine optimization (SEO) algorithm to promote its visibility.Potential consumers may therefore browse the catalog 220 from any webbrowser. The exchange manager 124 may expose uniform resource locators(URLs) linked to each listing 202. This URL may be searchable and can beshared outside of any interface implemented by the exchange manager 124.For example, the provider of a listing 202 may publish the URLs for itslistings 202 in order to promote usage of its listing 202 and its brand.

FIG. 3 illustrates a cloud environment 300 that includes a storageplatform 310 (similar to the cloud computing platform 110 illustrated inFIG. 1A) and a cloud deployment 305. The cloud deployment 305 maycomprise a similar architecture to cloud computing service 112(illustrated in FIG. 1A) and may be a deployment of a data exchange ordata marketplace. Although illustrated with a single cloud deployment,the cloud environment 300 may have multiple cloud deployments that maybe physically located in separate remote geographical regions, but mayall be deployments of a single data exchange or data marketplace.Although embodiments of the present disclosure are described withrespect to a data exchange, this is for example purpose only and theembodiments of the present disclosure may be implemented in anyappropriate enterprise database system or data sharing platform wheredata may be shared among users of the system/platform.

The cloud deployment 305 may include hardware such as processing device305A (e.g., processors, central processing units (CPUs), memory 305B(e.g., random access memory (RAM), storage devices (e.g., hard-diskdrive (HDD), solid-state drive (SSD), etc.), and other hardware devices(e.g., sound card, video card, etc.). A storage device may comprisepersistent storage capable of storing data. Persistent storage may be alocal storage unit or a remote storage unit. Persistent storage may be amagnetic storage unit, optical storage unit, solid state storage unit,electronic storage units (main memory), or similar storage unit.Persistent storage may also be a monolithic/single device or adistributed set of devices. The cloud deployment 305 may comprise anysuitable type of computing device or machine that has a programmableprocessor including, for example, server computers, desktop computers,laptop computers, tablet computers, smartphones, set-top boxes, etc. Insome examples, the cloud deployment 305 may comprise a single machine ormay include multiple interconnected machines (e.g., multiple serversconfigured in a cluster).

Databases and schemas may be used to organize data stored in the clouddeployment 305 and each database may belong to a single account withinthe cloud deployment 305. Each database may be thought of as a containerhaving a classic folder hierarchy within it. Each database may be alogical grouping of schemas and a schema may be a logical grouping ofdatabase objects, e.g., tables, views, etc. Each schema may belong to asingle database. Together, a database and a schema may comprise anamespace. When performing any operations on objects within a database,the namespace can be inferred from the current database and the schemathat is in use for the session. If a database and schema are not in usefor the session, the namespace may need to be explicitly specified whenperforming any operations on the objects.

The storage platform 310 may facilitate the storing of data and maycomprise any appropriate object storage service such as e.g., the AmazonS3™ service to store data and query results. The storage platform 310may comprise multiple buckets (databases) 311A-311C.

FIG. 3 also illustrates an example listing similarity determinationprocess in which the affinity between listings can be determined andmade available to cloud deployment 305.

Identifying “similar listings” to a user can involve two steps. First,similarity factors can be identified and determined for pairs oflistings. Second, an “affinity” score can be recorded for the pair oflistings. In some cases, an affinity score can be sometimes referred toas a similarity score. Then, for a particular listing, other listingshaving a high affinity, or similarity, to that listing can be presentedto the user.

In some embodiments, the cloud deployment 305 may contain an affinitystore 315. Affinity store 315 may be a table containing similaritytuples 330. In an example, a similarity tuple 330, or similarity record,may contain listings, e.g., 320A and 320B, and attributes 335 thatdescribe a measure of similarity between the listings 320A and 320B. Insome embodiments, these attributes 335 may include similarityinformation that describes the similarity or affinity between thelistings 320A and 320B. In some cases, a similarity tuple 330 mayinclude a pair of listings, 320A and 320B, as shown. In some cases, asimilarity tuple 330 may include any number of listings. In someembodiments, a similarity record may reference a set of listings. Insome cases, within the similarity tuple 330, listings 320A and 320B maybe examples of listing 202, as illustrated in FIG. 2 .

In some cases, an affinity score can be determined for each pair oflistings. In some embodiments, this affinity score may be stored as anattribute 335. There are a number of ways that affinity scores can becalculated, with varying degrees of quality and performance. In someembodiments, a distance score can be determined, showing thedissimilarity of a pair of listings. In some cases, listing metadata(predetermined static values that describe the listing, e.g., listingname, description text, and relevant industry vertical) can be used. Insome cases, information from the actual dataset content can be used. Insome cases, this information can include information derived from SQLmetadata and information about the actual values, e.g., datadistributions, numbers of distinct values, etc. In some cases, theinformation derived from SQL metadata can include the names of databasetables and the number and name of the columns in those database tables.In some cases, the information derived from SQL metadata can includeknowledge gained from machine learning classifiers. Examples of suchinformation can include the semantic type of each column (a column canbe named “Location,” be of type “string” or “integer,” and have asemantic type of “Zip Code.” For such a column, an embodiment couldscore the similarity of a pair of listings based on their zip codes. Forexample, a user may be interested in listings for a similar geographicarea. In some embodiments, the results of these similarity functions maybe recorded in the attributes 335.

In some cases, affinity scores could be based on the number of viewsthat a listing received, by the number of installations of a listing, bythe number of times that native applications within listings areexecuted, or other measures of usage. In some embodiments, multipleaffinity scores may be stored as attributes 335.

Listing metadata can include a structured field “category,” which can beused to determine the similarity of listings. In some cases, naturallanguage processing (NLP) can extract features from the description-textof a listing. These features can refer to the content of the listing andto the quality of the listing. In some embodiments, this information canbe recorded in attributes 335.

In some cases, attributes 335 may include timestamps as to when theaffinity calculation was performed, the criteria by which the affinitywas determined, an expiration time for the record 330, and a metricindicating the number of times the record was used. In some embodiments,time stamps may be included for purposes of time decay operations. Insome cases, attributes may be incorporated to support weighting ofsimilarity records 330. In some cases, attributes may be incorporated tosupport tiered formulas, such that listings by Company A that possesshigh affinity with certain other listings are not adversely affected bythe publication of new listings by Company A, which, due to theirnovelty, have low affinity scores.

In some cases, determining an affinity score for pairs of listings canbe performed offline, in advance of use. In other cases, the affinityscore can be obtained on-the-fly. For an on-the-fly identification ofsimilar listings, for a given listing, it can be compared to all otherlistings and those most similar identified. In some cases,dissimilarity, if distance functions are being used rather thansimilarity functions, can be used to identify the least dissimilarlistings.

In some embodiments, a decision to pre-compute and store, for eachlisting, its most similar counterpart listings, may depend on severalfactors, including system limitations and usage patterns. In some cases,these same factors may indicate on-the-fly determination of most similarlistings.

After a set of similar listings has been identified, the set can bepresented to a user. The presentation of most-similar listings can bedeterministic/strict, such that the order of similar listings isstrictly based on the affinity scores. In some cases, the presentationcan contain randomization, e.g., mix the order and/or select N listingsamong the N+M most similar. In some embodiments, randomizations can beinserted as a means to avoid situations such as promoting the top “mostsimilar” and demoting the “runners-up.”

In addition to suggesting listings that are relevant, or similar, toother listings, consumers can also be provided with a means of obtainingrecommendations for listings that reflect their personalized criteriaand the popularity of the listings.

In some cases, criteria for providing personalized recommendations caninclude information from the consumer's account, e.g., industry,company, role/job title, role/privileges within the data exchange. Insome embodiments, personalized recommendations can be influenced by theconsumer's exchange activity, e.g., previous searches and browsingactivity, and previous listing views, installs, or jobs. In some cases,personalized recommendations can be influenced by a consumer's existingdata, e.g., data in the exchange or queries. In some cases, personalizedrecommendations may be influenced by listing quality, e.g., listingcompleteness, provider reputation, etc.

Listing recommendations may be made based on (or additionally based on)popularity of listings. In an embodiment, popularity can be viewed as avector measure rather than a single value, and may be based on listingviews, installs/uninstalls for trials and purchases, and jobs (nativeapplications) run. In other words, a popularity metric may have multiplecomponents. In an embodiment, popularity may be decomposed intosub-values for the number of times a listing has been viewed, the numberof times a listing has been installed or retrieved (get request), andthe number of times a job (native application) has been executed. Insome cases, popularity may be based on the surface area on web pages.Popularity may also be determined by computing weighted popularityscores for each listing, then filtering by combinations of categories,business needs, and providers.

Given the determination of the popularity of a listing, the popularityof the providers of those listings can also be established. In somecases, considering a provider's listings, scoring functions such asaverage popularity score, average percentile, average over eachprovider's top-k, weighted averages based on quartile/global-top-k(e.g., 1.0 weight if in top quartile, 0.5 if in second, etc.), orcombinations of the above may be used. In some embodiments, PageRank orHubs and Authorities-like approaches may be used. PageRank works bycounting the number and quality of links to a webpage to determine arough estimate of how important the webpage is. An underlying assumptionis that more important web pages are likely to receive more links fromother web pages. A PageRank algorithm scores the popularity of a webpageA as the sum of the popularity scores of pages having outgoing links towebpage A divided by the number of outgoing links of each page. Bycomparison, a Hubs and Authorities algorithm (also known as ahyperlink-induced topic search, or HITS algorithm) assigns two scoresfor each web page, its authority, which estimates the value of thecontent of the page, and its hub value, which estimates the values ofits links to other web pages.

In some embodiments, taking an average can have adverse effects forhigh-quality providers with a small number of listings. In some cases,should a provider with a single most-popular listing add a secondlisting, their average would drop in half. In some cases, simply summingpopularity scores could encourage providers to prioritize quantity overquality. In some embodiments, a tiered approach to determiningpopularity can be used in which a tiered-sum approach is employed andeach provider's score can be a weighted sum of the popularity scores oftheir listings.

Furthermore, in some embodiments, these different components can beweighted. In some cases, a number of times a job has been run can beweighted higher than installs, which in turn can be weighted higher thanviews.

In some cases, the popularity of a listing can be affected bytime-decay. In some embodiments, time-decay allows a listing with 10views that are five-years-old to be represented as less popular than alisting with 2 views today. In some embodiments, time-decay can beaccomplished using a variety of functions, e.g., sliding windowfunctions, hyperbolic functions, exponential functions, and linear decayfunctions.

In some cases, presentation of “similar” recommendations can be achievedthrough a user interface (UI). “Similar” recommendations can bepresented as such, or combined with buttons (links) that enable directinteraction (purchase, installation, etc.) without requiring a user tovisit a dedicated webpage. In some embodiments, presentation can beperformed through an exposed programmatic interface.

In the case of listings including one or more functions, similarity (ordissimilarity) of pairs of functions can be determined after inspectingand comparing various characteristics of the functions. Thesecharacteristics may be either static or dynamic.

Static characteristics can include the number and types of inputarguments of the functions and the types of return values, a histogramof the commands in a function, the length of the functions in terms oflines-of-code or machine-code once compiled, and the structuralsimilarity of the control flow graphs. Control flow graphs can be graphrepresentation of executable functions. For example, lines that arealways executed consecutively are part of the same code-block, butif-(else)-statements, for/while-loops, go-to statements, return(function exits) and other such programming primitives may causemultiple possible paths of execution, or possible edges to othercode-blocks.

Dynamic characteristics can include the execution time or computermemory that functions require when applied on the same/similardata-inputs, the distribution of produced values as applied to a seriesof different data-inputs, or the actual code-commands executed whensame/similar values are input as parameters.

If listings contain multiple functions, different schemas can beemployed to determine similarity or dissimilarity with other listingsthat also contain one or more functions. These schemas can includeretaining the most or least similar, taking the average over all pairs,or a sum of similarity when each function is matched to the closest.Other approaches can also be applied. In some cases, the choice ofapproach can affect the final similarity rankings.

The calculated similarity information may be stored in the databases311A-311C in storage platform 310 or in some other location. In somecases, the calculated similarity information may not be stored, but mayrather be generated dynamically, upon user demand.

FIG. 4 is a flow diagram of a method 400 for determining the affinity orsimilarity of a set of listings, in accordance with some embodiments ofthe present disclosure. Method 400 may be performed by processing logicthat may comprise hardware (e.g., circuitry, dedicated logic,programmable logic, a processor, a processing device, a centralprocessing unit (CPU), a system-on-chip (SoC), etc.), software (e.g.,instructions running/executing on a processing device), firmware (e.g.,microcode), or a combination thereof. In some embodiments, the method400 may be performed by processing device 305A of cloud deployment 305,as illustrated in FIG. 3 .

At 405, the processing device 305A may calculate the affinity scores forand between the listings in the data exchange. In some cases, thecalculation may be done offline and in a batch mode. In some cases, thecalculations may be performed on-the-fly based on user activity. In someembodiments, the affinity scores can be stored as records in an affinitystore 315, as shown in FIG. 3 .

At 410, the processing device 305A may present a set of listings to aconsumer. In some cases, this presentation may be interactively througha user interface (UI). In some cases, this presentation may be madeprogrammatically.

In this way, in some embodiments, consumers can be provided with a meansof obtaining recommendations for “similar” listings that reflect theirpersonalized criteria, the popularity of the listings, and/or therelevance of the listings, based on attributes of the listingsthemselves.

FIG. 5 illustrates a diagrammatic representation of a machine in theexample form of a computer system 500 within which a set of instructionsmay reside, for causing the machine to perform any one or more of themethodologies discussed herein for determining the affinity of listingswithin a data exchange.

In alternative embodiments, the machine may be connected (e.g.,networked) to other machines in a local area network (LAN), an intranet,an extranet, or the Internet. The machine may operate in the capacity ofa server or a client machine in a client-server network environment, oras a peer machine in a peer-to-peer (or distributed) networkenvironment. The machine may be a personal computer (PC), a tablet PC, aset-top box (STB), a Personal Digital Assistant (PDA), a cellulartelephone, a web appliance, a server, a network router, a switch orbridge, a hub, an access point, a network access control device, or anymachine capable of executing a set of instructions (sequential orotherwise) that specify actions to be taken by that machine. Further,while only a single machine is illustrated, the term “machine” shallalso be taken to include any collection of machines that individually orjointly execute a set (or multiple sets) of instructions to perform anyone or more of the methodologies discussed herein. In one embodiment,computer system 500 may be representative of a server.

The exemplary computer system 500 includes a processing device 502, amain memory 504 (e.g., read-only memory (ROM), flash memory, dynamicrandom-access memory (DRAM), a static memory 505 (e.g., flash memory,static random-access memory (SRAM), etc.), and a data storage device518, which communicate with each other via a bus 530. Any of the signalsprovided over various buses described herein may be time multiplexedwith other signals and provided over one or more common buses.Additionally, the interconnection between circuit components or blocksmay be shown as buses or as single signal lines. Each of the buses mayalternatively be one or more single signal lines and each of the singlesignal lines may alternatively be buses.

Computing device 500 may further include a network interface device 507that may communicate with a network 520. The computing device 500 alsomay include a video display unit 510 (e.g., a liquid crystal display(LCD) or a cathode ray tube (CRT)), an alpha-numeric input device 512(e.g., a keyboard), a cursor control device 514 (e.g., a mouse) and anacoustic signal generation device 515 (e.g., a speaker). In oneembodiment, video display unit 510, alphanumeric input device 512, andcursor control device 514, which may be combined into a single componentor device (e.g., an LCD touch screen).

Processing device 502 represents one or more general-purpose processingdevices such as a microprocessor, central processing unit, or the like.More particularly, the processing device may be complex instruction setcomputing (CISC) microprocessor, reduced instruction set computer (RISC)microprocessor, very long instruction word (VLIW) microprocessor, orprocessor implementing other instruction sets, or processorsimplementing a combination of instruction sets. Processing device 502may also be one or more special-purpose processing devices such as anapplication specific integrated circuit (ASIC), a field programmablegate array (FPGA), a digital signal processor (DSP), network processor,or the like. The processing device 502 is configured to execute affinitydetermination instructions 525, for performing the operations and stepsdiscussed herein.

The data storage device 518 may include a machine-readable storagemedium 528, on which is stored one or more sets of affinitydetermination instructions 525, e.g., software, embodying any one ormore of the methodologies of functions described herein. The affinitydetermination instructions 525 may also reside, completely or at leastpartially, within the main memory 504 or within the processing device502 during execution thereof by the computer system 500; the main memory504 and the processing device 502 also constituting machine-readablestorage media. The affinity determination instructions 525 may furtherbe transmitted or received over a network 520 via the network interfacedevice 507.

The machine-readable storage medium 528 may also be used to storeinstructions to perform the methods described herein. While themachine-readable storage medium 528 is shown in an exemplary embodimentto be a single medium, the term “machine-readable storage medium” shouldbe taken to include a single medium or multiple media (e.g., acentralized or distributed database, or associated caches and servers)that store the one or more sets of instructions. A machine-readablemedium includes any mechanism for storing information in a form (e.g.,software, processing application) readable by a machine (e.g., acomputer). The machine-readable medium may include, but is not limitedto, magnetic storage medium (e.g., floppy diskette); optical storagemedium (e.g., CD-ROM); magneto-optical storage medium; read-only memory(ROM); random-access memory (RAM); erasable programmable memory (e.g.,EPROM and EEPROM); flash memory; or another type of medium suitable forstoring electronic instructions.

Unless specifically stated otherwise, terms such as “receiving,”“routing,” “granting,” “determining,” “publishing,” “providing,”“designating,” “encoding,” or the like, refer to actions and processesperformed or implemented by computing devices that manipulates andtransforms data represented as physical (electronic) quantities withinthe computing device's registers and memories into other data similarlyrepresented as physical quantities within the computing device memoriesor registers or other such information storage, transmission or displaydevices. Also, the terms “first,” “second,” “third,” “fourth,” etc., asused herein are meant as labels to distinguish among different elementsand may not necessarily have an ordinal meaning according to theirnumerical designation.

Examples described herein also relate to an apparatus for performing theoperations described herein. This apparatus may be specially constructedfor the required purposes, or it may comprise a general-purposecomputing device selectively programmed by a computer program stored inthe computing device. Such a computer program may be stored in acomputer-readable non-transitory storage medium.

The methods and illustrative examples described herein are notinherently related to any particular computer or other apparatus.Various general-purpose systems may be used in accordance with theteachings described herein, or it may prove convenient to construct morespecialized apparatus to perform the required method steps. The requiredstructure for a variety of these systems will appear as set forth in thedescription above.

The above description is intended to be illustrative, and notrestrictive. Although the present disclosure has been described withreferences to specific illustrative examples, it will be recognized thatthe present disclosure is not limited to the examples described. Thescope of the disclosure should be determined with reference to thefollowing claims, along with the full scope of equivalents to which theclaims are entitled.

As used herein, the singular forms “a,” “an,” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises,”“comprising,” “includes,” and/or “including,” when used herein, specifythe presence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof. Therefore, the terminology usedherein is for the purpose of describing particular embodiments only andis not intended to be limiting.

It should also be noted that in some alternative implementations, thefunctions/acts noted may occur out of the order noted in the figures.For example, two figures shown in succession may in fact be executedsubstantially concurrently or may sometimes be executed in the reverseorder, depending upon the functionality/acts involved.

Although the method operations were described in a specific order, itshould be understood that other operations may be performed in betweendescribed operations, described operations may be adjusted so that theyoccur at slightly different times, or the described operations may bedistributed in a system that allows the occurrence of the processingoperations at various intervals associated with the processing.

Various units, circuits, or other components may be described or claimedas “configured to” or “configurable to” perform a task or tasks. In suchcontexts, the phrase “configured to” or “configurable to” is used toconnote structure by indicating that the units/circuits/componentsinclude structure (e.g., circuitry) that performs the task or tasksduring operation. As such, the unit/circuit/component can be said to beconfigured to perform the task, or configurable to perform the task,even when the specified unit/circuit/component is not currentlyoperational (e.g., is not on). The units/circuits/components used withthe “configured to” or “configurable to” language include hardware—forexample, circuits, memory storing program instructions executable toimplement the operation, etc. Reciting that a unit/circuit/component is“configured to” perform one or more tasks, or is “configurable to”perform one or more tasks, is expressly intended not to invoke 35 U.S.C.112, sixth paragraph, for that unit/circuit/component. Additionally,“configured to” or “configurable to” can include generic structure(e.g., generic circuitry) that is manipulated by software and/orfirmware (e.g., an FPGA or a general-purpose processor executingsoftware) to operate in manner that is capable of performing the task(s)at issue. “Configured to” may also include adapting a manufacturingprocess (e.g., a semiconductor fabrication facility) to fabricatedevices (e.g., integrated circuits) that are adapted to implement orperform one or more tasks. “Configurable to” is expressly intended notto apply to blank media, an unprogrammed processor or unprogrammedgeneric computer, or an unprogrammed programmable logic device,programmable gate array, or other unprogrammed device, unlessaccompanied by programmed media that confers the ability to theunprogrammed device to be configured to perform the disclosedfunction(s).

Any combination of one or more computer-usable or computer-readablemedia may be utilized. For example, a computer-readable medium mayinclude one or more of a portable computer diskette, a hard disk, arandom-access memory (RAM) device, a read-only memory (ROM) device, anerasable programmable read-only memory (EPROM or Flash memory) device, aportable compact disc read-only memory (CDROM), an optical storagedevice, and a magnetic storage device. Computer program code forcarrying out operations of the present disclosure may be written in anycombination of one or more programming languages. Such code may becompiled from source code to computer-readable assembly language ormachine code suitable for the device or computer on which the code willbe executed.

Embodiments may also be implemented in cloud computing environments. Inthis description and the following claims, “cloud computing” may bedefined as a model for enabling ubiquitous, convenient, on-demandnetwork access to a shared pool of configurable computing resources(e.g., networks, servers, storage, applications, and services) that canbe rapidly provisioned (including via virtualization) and released withminimal management effort or service provider interaction and thenscaled accordingly. A cloud model can be composed of variouscharacteristics (e.g., on-demand self-service, broad network access,resource pooling, rapid elasticity, and measured service), servicemodels (e.g., Software as a Service (“SaaS”), Platform as a Service(“PaaS”), and Infrastructure as a Service (“IaaS”)), and deploymentmodels (e.g., private cloud, community cloud, public cloud, and hybridcloud).

The flow diagrams and block diagrams in the attached figures illustratethe architecture, functionality, and operation of possibleimplementations of systems, methods, and computer program productsaccording to various embodiments of the present disclosure. In thisregard, each block in the flow diagrams or block diagrams may representa module, segment, or portion of code, which comprises one or moreexecutable instructions for implementing the specified logicalfunction(s). It will also be noted that each block of the block diagramsor flow diagrams, and combinations of blocks in the block diagrams orflow diagrams, may be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions. These computerprogram instructions may also be stored in a computer-readable mediumthat can direct a computer or other programmable data processingapparatus to function in a particular manner, such that the instructionsstored in the computer-readable medium produce an article of manufactureincluding instruction means that implement the function/act specified inthe flow diagram and/or block diagram block or blocks.

The foregoing description, for the purpose of explanation, has beendescribed with reference to specific embodiments. However, theillustrative discussions above are not intended to be exhaustive or tolimit the invention to the precise forms disclosed. Many modificationsand variations are possible in view of the above teachings. Theembodiments were chosen and described in order to best explain theprinciples of the embodiments and its practical applications, to therebyenable others skilled in the art to best utilize the embodiments andvarious modifications as may be suited to the particular usecontemplated. Accordingly, the present embodiments are to be consideredas illustrative and not restrictive, and the invention is not to belimited to the details given herein, but may be modified within thescope and equivalents of the appended claims.

What is claimed is:
 1. A method comprising: determining a set ofaffinity metrics for a set of listings, each listing of the set oflistings comprising data to be shared through a data exchange, whereinthe set of affinity metrics includes a set of characteristics allowingidentification of a listing with one or more characteristics in the setof characteristics; for each pair of listings of the set of listings:calculating an affinity score, using the set of affinity metrics; andstoring a record in an affinity store, the record including the affinityscore; and presenting one or more listings of the set of listings usingthe affinity score between the first listing of the set of listings andthe one or more listings of the set of listings.
 2. The method of claim1, wherein the record further includes: a calculation timestampcomprising a date and a time that the affinity score was calculated; anexpiration time for the record; and a number of times the record hasbeen used.
 3. The method of claim 2, wherein the affinity score isweighted at least in part on time decay of the calculation timestamp. 4.The method of claim 1, wherein the affinity score is based at least inpart on usage of a listing, the usage comprising at least one of: anumber of views that the listing receives; a number of installations ofthe listing; or a number of executions of native applications within thelisting.
 5. The method of claim 1, further comprising: calculating adistance score indicating dissimilar characteristics for the pair oflistings of the set of listings, using the affinity metrics, for eachpair of listings of the set of listings; and storing, in the record inthe affinity store, the distance score.
 6. The method of claim 1,wherein the affinity metrics include SQL metadata from the set oflistings.
 7. The method of claim 6, wherein the SQL metadata includesknowledge gained from machine learning classifiers.
 8. The method ofclaim 1, wherein the affinity score is calculated in response to arequest for a presentation of listings similar to the first listing ofthe set of listings.
 9. The method of claim 1, wherein the affinityscore is pre-computed prior to a request for listings similar to thefirst listing of the set of listings.
 10. A system comprising: a memory;and a processing device, operatively coupled to the memory, theprocessing device to: determine a set of affinity metrics for a set oflistings, each listing of the set of listings comprising data to beshared through a data exchange, wherein the set of affinity metricsincludes a set of characteristics allowing identification of a listingwith one or more characteristics in the set of characteristics; for eachpair of listings of the set of listings: calculate an affinity score,using the set of affinity metrics; and store a record in an affinitystore, the record including the affinity score; and present one or morelistings of the set of listings using the affinity score between thefirst listing of the set of listings and the one or more listings of theset of listings.
 11. The system of claim 10, wherein the affinitymetrics include a number of times a listing of the set of listings hasbeen referenced.
 12. The system of claim 10, wherein an order ofpresentation of the listings of the set of listings is strictly in orderof affinity scores.
 13. The system of claim 10, wherein an order ofpresentation of the listings of the set of listings is based at least inpart on a randomization of a fixed number of listings.
 14. The system ofclaim 10, wherein the affinity metrics include SQL metadata from the setof listings.
 15. The system of claim 10, wherein an input to theaffinity metrics includes a structured field from the set of listings.16. The system of claim 10, wherein the affinity metrics support atiered formula such that affinity metrics associated with listings ofthe set of listings associated with an organization are unaffected by apublication of new listings of the set of listings associated with theorganization.
 17. The system of claim 10, wherein the record includes aplurality of affinity scores.
 18. A non-transitory computer-readablemedium storing instructions thereon which, when executed by a processingdevice, cause the processing device to: determine a set of affinitymetrics for a set of listings, each listing of the set of listingscomprising data to be shared through a data exchange, wherein the set ofaffinity metrics includes a set of characteristics allowingidentification of a listing with one or more characteristics in the setof characteristics; for each pair of listings of the set of listings:calculate an affinity score, using the set of affinity metrics; andstore a record in an affinity store, the record including the affinityscore; and present one or more listings of the set of listings using theaffinity score between the first listing of the set of listings and theone or more listings of the set of listings.
 19. The non-transitorycomputer-readable medium of claim 18, wherein the instructions furthercause the processing device to use natural language processing toextract information, as an input to the affinity metrics, from adescription in a listing of the set of listings.
 20. The non-transitorycomputer-readable medium of claim 18, wherein the affinity metricsinclude a number of times a listing of the set of listings has beenreferenced.
 21. The non-transitory computer-readable medium of claim 18,wherein the affinity metrics include SQL metadata from the listings ofthe set of listings.
 22. The non-transitory computer-readable medium ofclaim 21, wherein the SQL metadata further includes static values thatdescribe the listings of the set of listings.
 23. The non-transitorycomputer-readable medium of claim 21, wherein the SQL metadata furtherincludes database table names and database column names.
 24. Thenon-transitory computer-readable medium of claim 18, wherein the recordfurther includes: a calculation timestamp comprising a date and a timethat the affinity score was calculated; an expiration time for therecord; and a number of times the record has been used.
 25. Thenon-transitory computer-readable medium of claim 24, wherein theaffinity score is weighted based at least in part on time decay of thecalculation timestamp.
 26. The non-transitory computer-readable mediumof claim 18, wherein the affinity metrics include SQL metadata from thelistings of the set of listings.